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Abstract. This paper presents a new stochastic preconditioning approach. For symmetric 
diagonally-dominant M-matrices, we prove that an incomplete LDL factorization can be obtained 
from random walks, and used as a preconditioner for an iterative solver, e.g., conjugate gradient. 
It is argued that our factor matrices have better quality, i.e., better accuracy-size tradeoffs, than 
preconditioners produced by existing incomplete factorization methods. Therefore the resulting 
preconditioned conjugate gradient (PCG) method requires less computation than traditional PCG 
methods to solve a set of linear equations with the same error tolerance, and the advantage increases 
for larger and denser sets of linear equations. These claims are verified by numerical tests, and we 
provide techniques that can potentially extend the theory to more general types of matrices. 

1. Introduction. Preconditioning is a crucial part of an iterative linear equa- 
tion solver. Suppose a set of linear equations is Ax. — b, where ^ is a given square 
nonsingular matrix that is large and sparse, b is a given vector, and x is the unknown 
solution vector to be computed. A (multiplicative) preconditioner is a square nonsin- 
gular matrix T such that an iterative solver can solve the transformed linear system^ 
Tj4x ~ Th with a higher convergence rate. 

The quality of a preconditioner matrix T is how closely it approximates^ A~^. It 
is important to note that a preconditioned iterative solver only requires the evaluation 
of the product of T and a vector, and does not require an explicit representation of T 
in the form of a matrix. Consequently, any procedure that solves the system Ax. — v 
approximately can be viewed as a preconditioner, and the resulting approximate so- 
lution can be viewed as the needed product Tv, where v is any given vector. Existing 
preconditioning techniques can be roughly divided into two categories: explicit meth- 
ods and implicit methods In explicit preconditioning methods, which are often 
referred to as approximate inverse methods, the preconditioner T is in the form of a 
matrix, a product of matrices, or a polynomial of matrices, and therefore for any given 
vector V, the product Tv can be evaluated by matrix- vector multiplications [S1E3I- 
In implicit preconditioning methods, the preconditioner T is in the form of (A) , 
where A approximates A and is easier to solve, and therefore for any given vector 
V, the product Tv is evaluated by solving a linear system with the left-hand-side 
matrix A |S|[2S|- Although explicit preconditioning methods have the advantage of 
being easily parallelizable, implicit methods have been more successfully developed 
and more widely used. A prominent class of implicit preconditioners are those based 
on incomplete LU (ILU) factorization; for example, ILU(O), ILU(k) and ILUT are 
popular choices in numerical computation'^ P] |23| . 
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^This transformation uses left preconditioning, as against other options such as right precondi- 
tioning and split preconditioning 1231 . For simplicity, only left preconditioning is discussed in this 
paper; however, the incomplete factorization produced by the proposed approach can be easily used 
as a right or split preconditioner. 

^This can be measured by the spectral radius of the matrix (/ — TA), or by the condition number 
of the matrix TA. It is often difficult to analytically quantify either of the two values for a general 
matrix, and the discussion of preconditioner accuracy is mostly qualitative. 

^When matrix A is symmetric and positive definite, the ILU factors become the corresponding 
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For clarity of the presentation, most of the discussion in this paper is limited to 
a special class of left-hand-side matrices: matrix A is referred to as an R-matrix if 
it is a symmetric M- matrix and is irreducibly diagonally dominant. One exception is 
Section [7| which is dedicated to extending the theory to more general matrix types. 

For R-matrices A, the most widely used iterative solver is the Incomplete Cholesky 
factorization preconditioned Conjugate Gradient (ICCG) method 0^1] [221, which 
uses T — {BB^^ as the preconditioner, where B is an incomplete Cholesky factor of 
A. There are various existing techniques to produce B for ICCG. All of these perform 
Gaussian elimination on A, and use a specific strategy to drop insignificant entries 
during the process: ILU(O) applies a pattern-based strategy, and allows Bi^j ^ only 
if Ai,j 7^ 23 ; ILUT applies a value-based strategy, and drops an entry from B if its 
value is below a threshold, which is typically determined by multiplying the norm of 
the corresponding row of A by a small constant J23| ; a more advanced strategy can be 
a combination of pattern, threshold and other size limits such as maximum number 
of entries per row. 

Our proposed preconditioning technique belongs to the category of multiplica- 
tive implicit preconditioners based on incomplete factorization, and our innovation 
is a stochastic procedure for building the incomplete triangular factors. It is argued 
theoretically that our factor matrices have better quality, i.e., better accuracy-size 
tradeoffs, than preconditioners produced by existing incomplete factorization meth- 
ods. Therefore the resulting preconditioned conjugate gradient (PCG) method, which 
we refer to as the hybrid solver^ requires less computation than traditional PCG meth- 
ods to solve a set of linear equations with the same error tolerance, and the advantage 
increases for larger and denser sets of linear equations. We use numerical tests to com- 
pare our solver against both ICCG with ILU(O) and ICCG with ILUT, and provide 
techniques that can potentially extend the theory to more general types of matrices. 
Parts of this paper were initially published in [201 HD • 

We will now review previous efforts of using stochastic methods to solve systems 
of linear equations. Historically, the theory was developed on two seemingly inde- 
pendent tracks, related to the analysis of potential theory [S] U^j C3 CHI El CD and 
to the solution of systems of linear equations (8, ,_26^ J7| [^Hj ■ However, the two 
applications are closely related and research along each of these tracks has resulted 
in the development of analogous algorithms, some of which are equivalent. 

The second of these parallel tracks will be discussed here. The first work that 
proposed a random-walk based linear equation solver is :8i , although it was presented 
as a solitaire game of drawing balls from urns. It was proven in 8 that, if matrix 
A satisfies certain conditions, a game can be constructed and a random variable^ X 
can be defined such that E[X] — {A~^)ij, where {A~'^)ij is an entry of the inverse 
matrix of A. Two years later, the work in |28j continued this discussion in the for- 
mulation of random walks, and proposed the use of another random variable, and it 
was argued that, in certain special cases, this variable has a lower variance than X, 
and hence is likely to converge faster. Both "S* and "SS" have the advantage of being 
able to compute part of an inverse matrix without solving the whole system, in other 
words, localizing computation. Over the years, various descendant stochastic solvers 
have been developed 17 "5^1 [27], though some of them, e.g., [201 [23 j do not have the 
property of localizing computation. 

Early stochastic solvers suffer from accuracy limitations, and this was remedied 



incomplete Cholesky factors, and they are denoted with the same symbols in this paper. 
*The notations are different from the original ones used in 
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by the sequential Monte Carlo method proposed in jH] and ^|. Let x' be an ap- 
proximate solution to Ax — b found by a stochastic solver, let the residual vector be 
r = b — Ax.' , and let the error vector be z = x — x'; then Az = r. The idea of the 
sequential Monte Carlo method is to iteratively solve this system of equations using a 
stochastic solver, and in each iteration, to compute an approximate error vector z that 
is then used to correct the current solution x'. Although the sequential Monte Carlo 
method has existed for over forty years, it has not resulted in any powerful solver that 
can compete with direct and iterative solvers, due to the fact that random walks are 
needed in every iteration, resulting in a relatively high overall time complexity. 

2. Stochastic Linear Equation Solver. In this section, we study the un- 
derlying stochastic mechanism of the proposed preconditioner. It is presented as a 
stand-alone stochastic linear equation solver; however, in later sections, its usage is 
not to solve equations, but to build an incomplete factorization. 




2.1. The Generic Algorithm. Let us consider a random walk "game" defined 
on a finite undirected connected graph representing a street map, for example, Fig- 
ure 12.11 A walker starts from one of the nodes, and every day, he/she goes to an 
adjacent node I with probability pi i for I = 1, 2, • • • , degree(i), where i is the current 
node, degree(i) is the number of edges connected to node z, and the adjacent nodes 
are labeled 1, 2, • • • , degree(j). The transition probabilities satisfy 

dcgrcc('t) 

(2.1) E p^^i = ^ 

1=1 

The walker pays an amount rrii to a motel for lodging everyday, until he/she reaches 
one of the homes, which are a subset of the nodes. Note that the motel price rrii is a 
function of his/her current location, node i. The game ends when the walker reaches 
a home node; he/she stays there and gets awarded a certain amount of money, rriQ. 
We now consider the problem of calculating the expected amount of money that the 
walker has accumulated at the end of the walk, as a function of the starting node, 
assuming he/she starts with nothing. The gain function is therefore defined as 

(2.2) f{i) — i?[total money earned |walk starts at node i] 
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It is obvious that 

(2.3) /(one of the homes) = tjiq 

For a non-home node i, again assuming that the nodes adjacent to i are labeled 1,2, 
• • • , degree(i), the / variables satisfy 

degree(«) 

(2.4) /(z)= 

(=1 

For a random walk game with N non-home nodes, there are N linear equations similar 
to the one above, and the solution to this set of equations will give the exact values 
of / at all nodes. 

In the above equations obtained from a random walk game, the set of allowable 
left-hand-side matrices is a superset of the set of R- matrices'"^. In other words, given 
a set of linear equations Ax. = b, where A is an R-matrix, we can always construct a 
random walk game that is mathematically equivalent, i.e., such that the / values are 
the desired solution x. To do so, we divide the i'^ equation by Ai^i to obtain 

(2.5) ^^+E4^-.-^ 



Equation (|2.4|l and equation (|2.6(l have seemingly parallel structures. Let N be the 
dimension of matrix A^ and let us construct a random walk game with N non-home 
nodes, which are labeled 1, 2, • • • , iV. Due to the properties of an R-matrix, we have 

• ^— is a non-negative value and can be interpreted as the transition 
probability of going from node i to node j. 

• {^~-^^ can be interpreted as the motel price rrii at node i. 

However, the above mapping is insufficient due to the fact that condition (|2.1|) may 
be broken: the sum of the ^— x^) coefficients is not necessarily one. In fact, because 

all rows of matrix A are diagonally dominant, the sum of the ^— coefficients is 

always less than or equal to one. Condition (|2.1() can be satisfied if we add an extra 
transition probability of going from node i to a home node, by rewriting H2.6I) as 



(2.7) where 6- = - ^ A^j ■ mo 



It is easy to verify that — ^— — is a non-negative value for an R-matrix, and that the 
following mapping establishes the equivalence between equation H2.4II and equation 
while satisfying and 



^A left-hand-side matrix from a random walk game has all the properties of an R-matrix, except 
that it is not necessarily symmetric. 
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• is the transition probability of going from node i to node j. 

• — ^— — is the transition probabihty of going from node i to a home node 
with award ttiq. 

• (^~-^^^ is the motel price rrii at node i. 

The choice of toq is arbitrary because b'^ always compensates for the ttiq term in 
equation H2.7I) . and in fact mo can take different values in (|2.7|l for different rows 
i. Therefore the mapping from an equation set to a game is not unique. A simple 
scheme can be to let mo = 0, and then = 

To find Xi , the i**^ entry of solution vector x, a natural way is to simulate a certain 
number of random walks from node i and use the average monetary gain in these walks 
as the approximated entry value. If this amount is averaged over a sufficiently large 
number of walks by playing the "game" a sufficiently large number of times, then by 
the Law of Large Numbers acceptably accurate solution can be obtained. 

According to the Central Limit Theorem [221 , the estimation error of the above 
procedure is asymptotically a zero-mean Gaussian variable with variance inversely 
proportional to M, where M is the number of walks. Thus there is an accuracy- 
runtime tradeoff. In implementation, instead of fixing M, one may employ a stopping 
criterion driven by a user-specified error margin A and confidence level a: 

(2.8) P[-A<x',-x^<A]>a 

where x^ is the estimated i*"^ solution entry from M walks. 

2.2. Two Speedup Techniques. In this section, we propose two new tech- 
niques that dramatically improve the performance of the stochastic solver. They will 
play a crucial role in the proposed preconditioning technique. 

2.2.1. Creating Homes. As discussed in the previous section, a single entry in 
the solution vector x can be evaluated by running random walks from its corresponding 
node in the game. To find the complete solution x, a straightforward way is to repeat 
such procedure for every entry. This, however, is not the most efficient approach, 
since much information can be shared between random walks. 

We propose a speedup technique by adding the following rule: after the computa- 
tion of Xi is finished according to criterion (|2.8|l . node i becomes a new home node in 
the game with an award amount equal to the estimated value x[. In other words, any 
later random walk that reaches node i terminates, and is rewarded a money amount 
equal to the assigned x'^. Without loss of generality, suppose the nodes are processed 
in the natural ordering 1,2,- • •,A^, then for walks starting from node k, the node 
set {l,2,---,fc— 1} are homes where the walks terminate (in addition to the original 
homes generated from the strictly-diagonally-dominant rows of A), while the node set 
{fc. A: -|- 1, • • • , N} are motels where the walks pass by. 

One way to interpret this technique is by the following observation about (|2.4|l : 
there is no distinction between the neighboring nodes that are homes and the neigh- 
boring nodes that are motels, and the only reason that a walk can terminate at a home 
node is that its / value is known and is equal to the award. In fact, any node can be 
converted to a home node if we know its / value and assign the award accordingly. 
Our new rule is simply utilizing the estimated x'^ ~ Xi in such a conversion. 

Another way to interpret this technique is by looking at the source of the value 
x[. Each walk that ends at a new home and obtains such an award is equivalent to 
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an average of multiple walks, each of which continues walking from there according 
to the original game settings. 

With this new method, as the computation for the complete solution x proceeds, 
more and more new home nodes are created in the game. This speeds up the algorithm 
dramatically, as walks from later nodes are carried out in a game with a larger and 
larger number of homes, and the average number of steps in each walk is reduced. 
At the same time, this method helps convergence without increasing A/, because, as 
mentioned earlier, each walk becomes the average of multiple walks. The only cost^ 
is that the game becomes slightly biased when a new home node is created, due to 
the fact that the assigned award value is only an estimate, e.g. x'^ ^ Xi\ overall, the 
benefit of this technique dominates its cost. 

2.2.2. Bookkeeping. For the same left-hand-side matrix A, traditional direct 
linear equation solvers are efficient in computing solutions for multiple right-hand-side 
vectors after initial matrix factorization, since only a forward/backward substitution 
step is required for each additional solve. Analogous to a direct solver, we propose a 
speedup mechanism for the stochastic linear equation solver. 

The mechanism is a bookkeeping technique based on the following observation. 
In the procedure of constructing a random walk game discussed in Section ITTl the 
topology of the game and the transition probabilities are solely determined by matrix 
A, and hence do not change when the right-hand-side vector b changes. Only motel 
prices and award values in the game are linked to b. 

When solving a set of linear equations with matrix A for the first time, we create 
a journey record for every node in the game, listing the following information. 

• For any node i, record the number of walks performed from node i. 

• For any node i and any motel node'' j, record the number of times that walks 
from node i visit node j. 

• For any node i and any home node^ j, which can be either an initial home 
node in the original game or a new home node created by the technique from 
Section I2. 2. II record the number of walks that start from i and end at j. 

Then, if the right-hand-side vector b changes while the left-hand-side matrix A 
remains the same, we do not need to perform random walks again. Instead, we simply 
use the journey record repeatedly and assume that the walker takes the same routes, 
gets awards at the same locations, pays for the same motels, and only the award 
amounts and motel prices have been modified. Thus, after a journey record is created, 
new solutions can be computed by some multiplications and additions efficiently. 

Practically, this bookkeeping is only feasible after the technique from Section l^.2.11 
is in use, for otherwise the space complexity can be prohibitive for a large matrix. 

In the next section, this bookkeeping technique serves as an important basis of 
the proposed preconditioner. There the bookkeeping scheme itself is modified in such 
a way that a rigorous proof is presented in Section|^21showing the fact that the space 
complexity of the modified bookkeeping is upper-bounded by the space complexity of 
the matrix factorization in a direct solver. 

3. Proof of Incomplete Factorization. In this section, we build an incomplete 
LDL factorization of an R-matrix A by extracting information from the journey record 

^The cost discussed here is in the context of the stochastic solver only. For the proposed precon- 
ditioner, this will no longer be an issue. 

^The journey record is stored in a sparse fashion, and a motel j is included only if walks from 
node i visit j at least once. 

*A home node j is included in the journey record only if at least one walk from i ends at j. 



STOCHASTIC PRECONDITIONING 



7 



of random walks. The proof is described in two stages: Section IXTl proves that the 
journey record contains an approximate L factor, and then Section \'6 . 21 proves that its 
non-zero pattern is a subset of that of the exact L factor. The formula of the diagonal 
D factor is derived in Section IXSl 

The procedure of finding this factorization is independent of the right-hand-side 
vector b. Any appearance of b in this section is symbolic: its entries do not participate 
in actual computation, and the involved equations are true for any possible b. 

3.1. The Approximate Factorization. Suppose the dimension of matrix A is 
iV, and its fc'^ row corresponds to node k in Figure ITTl k = 1, 2, • • • , A^. Without loss 
of generality, assume that in the stochastic solution, the nodes are processed in the 
natural ordering 1, 2, • • • , A^. According to the speedup technique in Section 2. II for 
random walks that start from node k, the nodes in the set {1,2,- • • , A: — 1} are already 
solved and they now serve as home nodes where a random walk ends. The awards 
for reaching nodes {1,2,- ••,/c— 1} are the estimated values of {xi,X2t - • ■ ,Xk-i} 
respectively. Suppose that in equation 12. 7() . we choose toq — 0, and hence the motel 
prices are given by = —^^^ for i = k,k -\- 1, - ■ ■ ,N . Further, 

• Let Mfc be the number of walks carried out from node k. 

• Let Hk^i be the number of walks that start from node k and end at node 
iG{l,2,---,fc-l}. 

• Let Jfc^j be the number of times that walks from node k pass the motel at 
node z e {/c, fc + 1, • • • , A^}. 

Taking the average of the results of the walks from node fc, we obtain the 
following equation for the estimated solution entry. 

(3-1) Xk = ^7 

where x- is the estimated value of for i G {1, 2, • • • , /c — 1}. Note that the awards 
received at the initial home nodes are ignored in the above equation since toq — 0. 
Moving the Hk^i terms to the left side, we obtain 

2=1 i=k 

By writing the above equation for k — 1, 2, • • • , A^, and assembling the N equations 
together into a matrix form, we obtain 

(3.3) Fx' ^ Zh 

where x' is the approximate solution produced by the stochastic solver; Y and Z are 
two square matrices of dimension A^ such that 

Yk,k = 1, VA: 
Yk,, = 0, Vfc < z 

(3.4) Zfc = 0, yk>i 



Yk,i = TT^, Vfc > I 
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These two matrices Y and Z are the journey record built by the bookkeeping 
technique in Section !^. 2. 21 Obviously F is a lower triangular matrix with unit diagonal 
entries, Z is an upper triangular matrix, and their entries are independent of the right- 
hand-side vector b. Once Y and Z are built from random walks, given any b, one 
can apply equation (|3.3|) and find x' efficiently by a forward substitution. 

It is worth pointing out the physical meaning of the entries in matrix Y: the 
negative of an entry, (— Yi-^i), is asymptotically equal to the probability that a random 
walk from node k ends at node i, when Mk goes to infinity. Another property of matrix 
Y is that the sum of every row is zero, except for the first row where only the first 
entry is non-zero. 

From equation H3.3|) . we have 

(3.5) Z-^Yx = h 

Since the vector x' in the above equation is an approximate solution to the original 
set of equations Ax = b, it follows that^ 

(3.6) Z-^Y K A 

Because the inverse of an upper triangular matrix, Z^^, is also upper triangular, 
equation (|3.t)|l is in the form of an approximate "UL factorization" of A. The following 
definition and lemma present a simple relation between UL factorization and the more 
commonly encountered LU factorization. 

Definition 3.1. The operator rev(-) is defined on square matrices as follows: 
given matrix A of dimension N , rev(A) is also a square matrix of dimension N , such 
that 

rev(A)jj = AN+i-t,N+i-j, Vi,je {1,2, •••,A^} 



In simple terms, the operator rev(-) merely inverts the row and column ordering 
of a matrix. A simple example of applying this operator is as follows: 
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LEMMA 1. Let A — LU be the LU factorization of a square matrix A, then 
rev(A) = rev(L)rev(t/) is true and is the UL factorization o/rev(A). 

Lemma ^ is self-evident, and the proof is omitted. It states that the reverse- 
ordering of the LU factors of A are the UL factors of reverse-ordered A. 

Applying Lemma ^ on equation (|3.6() . we obtain 

(3.7) rev(Z-i)rev(y) « rev(A) 

SPor any vector b, we have (Z-iy)~^b = x' x = A'^b. Therefore, A(Z-iy)"^b ss 
b, and then — A (^Z~^Y^ ^ b 0. Since this is true for any vector b, it must be true for 

eigenvectors of the matrix — A(^Z~^Y^ ^, and it follows that the eigenvalues of the matrix 

— A (^Z^^Y^ ^ are all close to zero. Thus we claim that Z~^Y RJ A. 
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Since A is an R-matrix and is symmetric, rev(yl) must be also symmetric, and we can 
take the transpose of both sides, and have 

(3.8) {^ev{Y)f (rev(Z-i))^ « rev(A) 

The above equation has the form of a Doolittle LU factorization [7j: matrix (rev(F))"'" 
is lower triangular with unit diagonal entries; matrix (rev{Z~^)^'^ is upper triangular. 

LEMMA 2 . The Doolittle L U factorization of a square matrix is unique. 

The proof of Lemma [3 is omitted. Let the exact Doolittle LU factorization of 
rev{A) be rev{A) = ircv(A)C^rcv(A), and its exact LDL factorization be rev(A) = 

-^rcv(A)^rcv(A) (irov(yi) ) • Since H3.8() is an approximate Doolittle LU factorization of 
rev(A), while the exact Doolittle LU factorization is unique, it must be true that: 

(3.9) (rev(y))^ « L,,^^a) 

(3.10) (rev(Z"^))'^ sa [/rov(A) = Aov(A) (ircv(A))'^ 

The above two equations indicate that from the matrix Y built by random walks, 
we can obtain an approximation to factor ij.ev(A)i a-^d that the matrix Z contains 
redundant information. Section 13.31 shows how to estimate matrix D^cv{A) utilizing 
only the diagonal entries of matrix Z, and hence the rest of Z is not needed at all. 
According to equation H3.4(l . matrix Y is the award register in the journey record 
and keeps track of end nodes of random walks, while matrix Z is the motel-expense 
register and keeps track of all intermediate nodes of walks. Therefore matrix Z is 
the dominant portion of the journey record, and by removing all of its off-diagonal 
entries, the modified journey record is significantly smaller than that in the original 
bookkeeping technique from Section [2. 2. 21 In fact, an upper bound on the number of 
non-zero entries in matrix Y is proven in the next section. 

3.2. The Incomplete Non-zero Pattern. The previous section proves that 
an approximate factorization of an R-matrix A can be obtained by random walks. 
However, it does not constitute a proof of incomplete factorization, because an in- 
complete factorization implies that the non-zero pattern of the approximate factor 
must be a subset of the non-zero pattern of the exact factor. Such a proof is the task 
of this section: to prove that an entry of (rev(y))"'" can be possibly non-zero only if 
the corresponding entry of ircv(A) is non-zero. 

For i ^ j, the {i,j) entry of (rev(y))"'" is as follows, after applying Definition 13. II 
and equation H3.4|) . 



(3.11) ((rev(y))^) ^YN+i-j,N+i^r 



N+l-j,N+l-i 



This value is non-zero if and only if j < i and His!+i-j,N+i-i > 0. In other words, at 
least one random walk starts from node {N -t- 1 — j) and ends at node (TV +1 — i). 

To analyze the non-zero pattern of -Lrcv(yi)5 certain concepts from the literature 
of LU factorization are used here, and certain conclusions are cited without proof. 
More details can be found in [Jj [7] [HI QD] ■ Figure 13.11 illustrates one step in the 
exact Gaussian elimination^" of a matrix: removing one node from the matrix graph. 



'^"LU factorization of a matrix is a sequence of Gaussian elimination steps. From the perspective 
of the matrix graph, it is a sequence of graph operations that remove nodes one by one. 
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Fig. 3.1. One step in Gaussian elimination. 

and creating a clique among its neighbors. For example, when node vi is removed, a 
clique is formed for {w2, 1's, ^^4, ^5, wg}, where the new edges correspond to fills added 
to the remaining matrix. At the same time, five non-zero values are written into the 
L matrix, at the five entries that are the intersections^^ of node wi's corresponding 
column and the five rows that correspond to nodes {u2, W3, W4, ^5, t^g}- 

Definition 3.2. Given a graph G — {V,E), a node set S d V, and nodes 
wi,W2 S V such that Wi,f2 ^ S, node V2 is said to be reachable from node vi through 
S if there exists a path between vi and V2 such that all intermediate nodes, if any, 
belong to S . 

Definition 3.3. Given a graph G = {V,E), a node set S dV, a node vi ^ V 
such that vi ^ S, the reachable set of vi through S, denoted R{vi,S), is defined as: 

R (ui, S) — {v2 ^ S\v2 is reachable from vi through S} 

Note that if vi and V2 are adjacent, there is no intermediate node on the path 
between them, then Definition 13.21 is satisfied, and V2 is reachable from vi through 
any node set. Therefore, R{vi,S) always includes the direct neighbors of vi that do 
not belong to S. 

Given an R-matrix A, let G be its matrix graph, let L be the complete L factor 
in its exact LDL factorization, and let vi and V2 be two nodes in G. Note that every 
node in G has a corresponding row and a corresponding column in A and in L. The 
following lemma can be derived from JHI QS] ■ 

LEMMA 3. The entry in L at the intersection of column vi and row V2 is non-zero 
if and only if: 

1. vi is eliminated prior to V2 during Gaussian elimination 

2. V2 G R{vi, {nodes eliminated prior to vi}) 

We now apply this lemma on -/ji.ov(A)- Because the factorization of rev(j4) is 
performed in the reverse ordering, i.e., N,N — 1, • • • , 1, the (i, j) entry of Lrev(A) is 
the entry at the intersection of the column that corresponds to node {N + 1 — j) and 
the row that corresponds to node {N + 1 — i). This entry is non-zero if and only if 
both of the following conditions are met. 

1. Node {N + I ~ j) is eliminated prior to node {N + 1 — i) 

2. {N +l-i) e R{N +l-j,Sj) 

where Sj = {nodes eliminated prior to N + 1 — j} 
Again, because the Gaussian elimination is carried out in the reverse ordering 

^^In this section, rows and columns of a matrix are often identified by their corresponding nodes 
in the matrix graph, and matrix entries are often identified as intersections of rows and columns. The 
reason is that such references are independent of the matrix ordering, and thereby avoid confusion 
due to the two orderings involved in the discussion. 
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TV, — 1, • • • , 1, the first condition imphes that 



N+l-j>N+l-i 



j < i 



The node set Sj in the second condition is simply {N + 2 — j, iV + 3 — j, • • • , N}. 

Recall that equation H3.11|l is non-zero if there is at least one random walk that 
starts from node {N + 1 — j) and ends at node {N +1 — i). Also recall that according 
to Section 12.2.11 when random walks are performed from node (N + 1 — j), nodes 
{1, 2, ■ ■ ■ , N — j} are home nodes that walks terminate, while nodes Sj — {N + 2 — 
j, N + 3 — j, ■ ■ ■ , N} are the motel nodes that a walk can pass through. Therefore, a 
walk from node {N + I — j) can possibly end at node {N + 1 — i), only if {N + 1 — i) 
is reachable from {N + 1 — j) through the motel node set, i.e., node set Sj. 

By now it is proven that both conditions for ^ ■ to be non-zero are nec- 

essary conditions for equation H3.11|l to be non-zero. Therefore, the non-zero pattern 
of (rev(y)) is a subset of the non-zero pattern of irov(A)- Together, this conclusion 
and equation H3.9I) give rise to the following lemma. 

LEMMA 4. (rev(y)) is the L factor of an incomplete LDL factorization of matrix 
vev{A) . 

This lemma indicates that, from random walks, we can obtain an incomplete 
LDL factorization of the left-hand-side matrix A in its reversed index ordering. The 
remaining approximate diagonal matrix D is derived in the next section. 

3.3. The Diagonal Component. To evaluate the approximate D matrix, we 
take the transpose of both sides of equation (|3.10|l . and obtain 



LEMMA 5. For a non- singular square matrix A, TeY{A^^) — (rev (A)) 
The proof of this lemma is trivial and is omitted. Applying this lemma on equation 
(|3.12|l . we have 



nal entries, and that -Drov(yi) is a diagonal matrix. Therefore, the (j,i) diagonal entry 
in the above equation is simply 



(3.12) 



rev(Z ^) « ircv(A)Aov(A) 




(rev(Z))^^^ (ircv(A)),^, (A-cv(A)),^, ~ 1 

(rev(Z)),_, • 1 • (A.cv(A)),,, « 1 



(3.14) 




Applying Definition l3 . H and equation (|3.4|) , we finally have the equation for computing 
the approximate D factor, given as follows. 
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It is worth pointing out the physical meaning of the quantity ' . It is 

the average number of times that a walk from node N + 1 — i passes node N + 1 — i 
itself; in other words, it is the average number of times that the walker returns to 
his/her starting point before the game is over. Equation H3.15|l indicates that an entry 
in the D factor is equal to the corresponding diagonal entry of the original matrix A 
divided by the expected number of returns. 

4. The Hybrid Solver and its Comparison with ILU. In this section, the 
proposed hybrid solver for R-matrices is presented in its entirety, and we argue that 
it outperforms traditional ICCG methods. 

Definition 4.1. The operator rev {■) is defined on vectors as follows: given vector 
X of length N , rev(x) is also a vector of length N, such that rev(x)i — XAr_|_i_j,V« G 
{1,2,. ••,7V}. 

It is easy to verify that the set of equations Ax = b is equivalent to 

rev (A) rev (x) — rev(b) 

By now, we have collected the necessary pieces of the proposed hybrid solver, and it 
is summarized in the pseudocode in Algorithm ^ 

Algorithm 1. The final hybrid solver for R-matrices: 
Precondition { 

Run random walks, build matrix Y and find diagonal 

entries of Z using equation ( 13 .41 ) ; 
Build Lrcv(A) using equation ( 13 . 91 ) ; 
Build -Di.cv(A) using equation ( 13 . 151 ) ; 

} 

Given b, solve { 

Convert Ax = b to rev(A)rev(x) = rev(b) ; 
Apply PCG on rev(A)rev(x) — rev(b) with the 

preconditioner ^I/rev(yl)-Drev{A) (irev{A)) ^ ; 
Convert rcv(x) to x; 
J 

The proposed hybrid solver essentially replaces the preconditioner in existing 
ICCG methods with the incomplete LDL factorization produced by random walks. 
We claim that this new preconditioner has better quality than the incomplete Cholesky 
factor B produced by traditional incomplete factorization approaches. In other words, 
if matrices Y and B have the same number of non-zero entries, and given the same 
target accuracy requirement, we expect the hybrid solver to converge with fewer 
iterations than a traditional ICCG solver preconditioned by [BB^^ 

The argument is based on the fact that, in traditional Gaussian-elimination-based 
methods, the operations of eliminating different nodes are correlated and the error 
introduced at an earlier node gets propagated to a later node, while in random walks, 
the operation on a node is totally independent from other nodes. We now state this 
in detail and more precisely. 

Let us use the ILUT approach as an example of traditional preconditioning meth- 
ods; similar argument can be made for other existing techniques, as long as they are 
based on Gaussian elimination. Suppose in Figure rTTl when eliminating node vi, the 
new edge between nodes V2 and W3 corresponds to an entry whose value falls below 
a specified threshold, then ILUT drops that entry from the remaining matrix, and 
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that edge is removed from the remaining matrix graph. Later when the algorithm 
reaches the stage of ehminating node W2, because of that missing edge, no edge is 
created from V3 to the neighbors of V2, and thus more edges are missing, and this 
new set of missing edges then affect later computations accordingly. Therefore, an 
early decision of dropping an entry is propagated throughout the ILUT process. On 
the one hand, this leads to the sparsity of B, which is desirable; on the other hand, 
there is no control over error accumulation, and later columns of B can deviate from 
the exact Cholesky factor by an amount that is greater than the planned threshold of 
ILUT. Such error accumulation gets exacerbated for larger and denser matrices. 

The hybrid solver does not suffer from this problem. When we run random walks 
from node k and collect the Hk.i values to build the fc*^ row of matrix Y according to 
equation (|3.4(l . we only know that the nodes {1, 2, • • • , k— 1} are homes, and this is the 
only information needed. If, for some reason, the computed k^^ row of matrix Y is of 
lower quality, this error does not affect other rows in any way; each row is responsible 
for its own accuracy, according to a criterion to be discussed in Section [5. II In fact, in 
a parallel computing environment, the computation of each row of Y can be assigned 
to a different processor. 

It is worth pointing out that the error accumulation discussed here is different 
from the cost of bias discussed at the end of Section l!^.2.1l That bias in the stochastic 
solver, in the context of the hybrid solver, maps to the forward/backward substitu- 
tion, i.e., the procedure of applying the preconditioner inside PCG. Due to the fact 
that forward/backward substitution is a sequential process, such bias or error prop- 
agation is inevitable in all iterative solvers as long as an implicit factorization-based 
multiplicative preconditioner is in use. Our claim here is that the hybrid solver is 
free of error accumulation in building the preconditioner, and not in applying the 
preconditioner^^ . 

In summary, because of the absence of error accumulation in building the precon- 
ditioner, we expect the hybrid solver to outperform traditional ICCG methods, and 
we expect that the advantage becomes more prominent for larger and denser matrices. 

5. Implementation Issues. This section describes several implementation as- 
pects of the proposed preconditioning technique. The goal is twofold: 

• To minimize the runtime of building the preconditioner. In other words, the 
computation given in the first part of Algorithm ^ should be performed with 
the fewest random walks. 

• To achieve a better accuracy-size tradeoff. That is cither to improve the 
accuracy of the preconditioner without increasing the number of non-zero 
entries, or to reduce the number of non-zeroes without losing accuracy^^. 

5.1. Stopping Criterion. The topic of this section is the accuracy control of 
the preconditioner, that is, how should one choose Mj^, the number of walks from 
node fc, to achieve a certain accuracy level in estimating its corresponding entries in 
the LDL factorization. In Section ITTl the stopping criterion in the stochastic solver 
is chosen to be an error margin and a confidence level defined on the result of a walk; 
it is not applicable to the hybrid solver because here it is necessary for the criterion 

After a row of matrix Y is calculated, it is possible to add a postprocessing step to drop insignif- 
icant entries. The criterion can be any of the strategies used in traditional incomplete factorization 
methods, and, as discussed in Section may be based on pattern, threshold, size limits, or a com- 
bination of them. With such postprocessing, the hybrid solver still maintains the advantage of 
independence between row calculations. This is not included in our implementation. 
^^Again, the discussion of preconditioner accuracy is mostly qualitative. 
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to be independent of the right-hand-side vector b. In our implementation, a new 
stopping criterion is defined on a value that is a function of only the left-hand-side 
matrix A, as follows. Let Ek = E [length of a walk from node k], and let SJ. be the 
average length of the walks. The stopping criterion is chosen as 

(5.1) P[-A < =*=_-r^i < A] > a 

where A is a relative error margin, and a is a confidence level, for example a = 99%. 
Practically, this criterion is checked by the following inequality: 



(5.2) 

where ak is the standard deviation of the lengths of the Mk walks, and Q is the 
standard normal complementary cumulative distribution function. Thus, Mk is de- 
termined dynamically, and random walks are run from node k until condition (|5.1|l 
is satisfied. In practice, it is also necessary to impose a lower bound on Mk, e.g., 20 
walks. 

Note that this is not the only way to design the stopping criterion: it can also be 
defined on quantities other than (for example, the expected number of returns), 
as long as this quantity does not depend on b. 

5.2. Exact Computations for One-step Walks. The implementation tech- 
nique in this section is a special treatment for the random walks with length 1, which 
we refer to as one-step walks. Such a walk occurs when an immediate neighbor of 
the starting node is a home node, and the first step of the walks happens to go there. 
The idea is to place stochastic computations performed by one-step walks with their 
deterministic limits. 

Without loss of generality, assume that the node ordering in the hybrid solver 
is the natural ordering 1,2, • • • ,iV. Let us consider the Mk walks from node fc, and 
suppose at least one of its immediate neighboring nodes is a home node, which could be 
either an initial home node if the fc'^ row of matrix A is strictly diagonally dominant, 
or a node j such that j < k. Among the walks, let Mj, i be the number of one-step 
walks, and let Hk^i^i be the number of one-step walks that go to node i, where node 
i is an arbitrary node such that i < k. For the case that node i is not adjacent to 
node k, i?/c,i,i is simply zero. For the case that node i is adjacent to node k, note 
that Hk,i,i may not be equal to Mk.i, as there can be other immediate neighbors of 
k that are home nodes. The Yk.i formula in l|^14|l can be rewritten as 

" ~ ~~ Mk ~\ Wk ) ■ ^ Mk - Mk,i 

Applying the mapping between transition probabilities and matrix entries in equa- 
tion H2.7|l . the following equations can be derived. 

lim — = P [first step goes to node i\ 



(5.4) 



A 



k.i 



A 



k,k 



lim — — — P [first step goes to a non-absorbing node! 

Mfc^oo Mk I f b b J 
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(5.5) 



P [first step goes to node j] 

j>k 

^j>k 



ife,fe 



We modify equation H5.H|I by replacing the term and the term jj^'''^ with 

their limits given by the above two equations, and obtain the following new formula 
for evaluating Yk^i. 

Y. — ^'^'^ ' f^j>k^'^d\ ( Hk,i — Hk,i,l 



Ak,k V Ak,k J \ Mk- Mk- 

The remaining stochastic part of this new equation is the term ^mI-ai^'i^ ' which 
can be evaluated by considering only random walks whose length is at least two; 
in other words, one-step walks are ignored. In implementation, this can be realized 
by simulating the first step of walks by randomly picking one of the non-absorbing 
neighbors of node k; note that then the number of random walks would automatically 
be {Mk — Mk^i), and no adjustment is needed. 

With a similar derivation, the Zk.k formula^** in H3.4|l can be modified to 



(5.7) Zk 



k,k 



1 Xlj>fc^fc,i (J2j>k^'^d\ (Jk,k~Jk 



Ak,k Al ,^ I ^2 / \Mk- Mk,- 



.k.l 



where Jk,k,i is the number of times that one-step walks pass node k. Obviously 
Jk k I = Mk 1 , and therefore 



1 , ( T,j>k ^fc J ^ Jk,k - M, 



k,l 



The remaining stochastic part of this new equation, the term "^^j^Zj^j^'^ , again can be 
evaluated by considering only random walks with length being at least two. Practi- 
cally, such computation is concurrent with evaluating Yk/s based on equation H5.6|l . 
The benefit of replacing (|3.4I) with equations H5.6|l and H5.8|l is twofold: 

• Part of the evaluation of Yk,i and Zk,k entries is converted from stochastic 
computation to its deterministic limit, and the accuracy is potentially im- 
proved. For the case when all neighbors of node k have lower indices, i.e., 
when all neighbors are home nodes, equations (|5.()|l and (|5.8|) become exact: 
they translate to the exact values of the corresponding entries in the complete 
LDL factorization. 

• By avoiding simulating one-step walks, the amount of computation in building 
the preconditioner is reduced. For the case when all neighbors of node k are 
home nodes, the stochastic parts of (|5.6() and (|5.8() disappear, and hence no 
walks are needed. 

5.3. Reusing Walks. Without loss of generality, assume that the node ordering 
in the hybrid solver is the natural ordering 1, 2, • • • , TV. A sampled random walk is 
completely specified by the node indices along the way, and hence can be viewed as a 



* Recall that we only need diagonal entries of matrix Z. 
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sequence of integers {fci, fc2, • ■ • , /cp}, such that ki > fcp, that ki < ki,yi Cz {2, • • • , F — 
1}, and that an edge exists between node fc; and node fc;+i, VZ G {l,---,r — 1}. 
If a sequence of integers satisfy the above requirements, it is referred to as a legal 
sequence, and can be mapped to an actual random walk. 

Due to the fact that a segment of a legal sequence may also be a legal sequence, 
it is possible to extract multiple legal sequences from a single simulated random 
walk, and use them also as random walks in the evaluation of equation (|3.4|l or its 
placement, (|5.6|l and H5.8|l . However, there are rules that one must comply with when 
extracting these legal sequences. A fundamental premise is that random samples must 
be independent of each other. If two walks share a segment, they become correlated. 
Note that if two walks have different starting nodes, they never participate in the same 
equation (|5.6|) or H5.8|) . and hence are allowed to share segments; if two walks have 
the same starting nodes, however, they are prohibited from overlapping. Moreover, 
due to the technique in the previous section, any one-step walk should be ignored. 

(a) {2,4,6,4,5,7,6,3,2,5,8,1} 

(b) {4,6,4,5,7,6,3} 
{5,7,6,3} 
{5,8,1} 

Fig. 5.1. An example of (a) the legal sequence of a simulated random walk and (b) three extra 
walks extracted from it. 

Figure ISTI shows an example of extracting multiple legal sequences from a single 
simulated random walk. The sequence {2, 5, 8, 1} cannot be used because it has the 
same starting node as the entire sequence; the sequence {4, 5, 7, 6, 3} cannot be used 
because it has the same starting node as {4,6,4,5,7,6,3} and the two sequences 
overlap^^. On the other hand, {5,7,6,3} and {5,8,1} are both extracted because 
they do not overlap and hence are two independent random walks. 

Considering all of the above requirements, the procedure is shown in Algorithm|21 
where the extracted legal sequences are directly accounted for in the M, H and J 
accumulators, which are defined the same as in all equations in this paper. Note that 
the simulated random walk is never stored in memory, and the only extra storage due 
to this technique is the stacks, which contain a monotonically increasing sequence of 
integers at any moment. 

This technique reduces the preconditioning runtime by fully utilizing the informa- 
tion contained in a single simulated random walk, such that it contributes to equations 
1)5.6(1 and H5.8|l as multiple random walks. It also guarantees that no two overlapping 
walks have the same starting node, and hence does not hurt the accuracy of the pro- 
duced preconditioner. The only cost of this technique is that the node ordering of 
the hybrid solver must be determined beforehand, and hence pivoting is not allowed 
during the incomplete factorization^^. 

^^It is also legitimate to extract {4,5,7,6,3} instead of {4,6,4,5,7,6,3}. However, the premise 
of random sampling must be fulfilled: the decision of whether to start a sequence with A:2 = 4 must 
be made without the knowledge of numbers after k2 , and the decision of whether to start a sequence 
with ki = 4 must be made without the knowledge of numbers after fc4. The strategy in Algorithml2lis 
to start a sequence as early as possible, and hence produces {4, 6, 4, 5, 7, 6, 3} instead of {4, 5, 7, 6, 3}. 

^^For R-matrices, or in general for diagonally dominant matrices, pivoting is not needed. For more 
general matrices to be discussed in Section |7] however, the usage of this technique may be limited. 
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Algorithm 2 . Extract multiple random walks from a single simulation: 
stackl .push( ki ); 
stack2 .push( 1 ) ; 

For l — 2,3,---, until the end of walk, do { 
While ( ki < stackl. top () ){ 
If( l> stack2.top()+l ){ 
k' = stackl . topO ; 

Mfc, = A/fe, + 1 ; 
Hy ,ki = Hy + 1 ; 

Jk' .k' = Jk' .k> + 1 ; 

} 

stackl .popO ; 
stack2 .popO ; 

} 

If( ki> stackl. top () ){ 
stackl. push( ki ); 
stack2.push( I ) ; 

} 

else Jki,ki ~ Jki,ki + 1; 
} 



5.4. Matrix Ordering. In existing factorization-based preconditioning tech- 
niques, matrix ordering can affect the performance, i.e., the accuracy-size tradeoff, 
of the preconditioner. The same statement is true for the proposed stochastic pre- 
conditioner. In general, we perform an incomplete LDL factorization of the reverse 
ordering of matrix A, we can apply any existing ordering method on A, reverse the 
ordering that it produces, and then use the resulting ordering. In this way, any benefit 
of that ordering method can be inherited by us. The following are a few examples of 
practical ordering schemes for the stochastic preconditioning. 

• Approximate minimum degree ordering (AMD) from is one of the state- 
of-the-art ordering techniques to reduce the number of non-zero entries in 
a complete LU factorization, or LDL factorization for an R-matrix. Since 
the complete L factor has a smaller size, it is likely that with the same size, 
the incomplete L factor may have better quality. Therefore, using a reversed 
AMD ordering may improve the accuracy-size tradeoff. 

• Reverse Cuthill-McKee ordering (RCM) from [S] is a simple but useful or- 
dering technique to reduce the bandwidth of both the original matrix A and 
the complete LU factors, and thereby improve cache efficiency. The physical 
CPU time of applying the LU factors on a particular right-hand-side vector is 
reduced due to less cache misses. For the hybrid solver, this means that, with 
the same preconditioner size, the actual CPU time of applying the precon- 
ditioner may be reduced. Of course, the ordering to use should be reversed 
RCM, which becomes the original Cuthill-McKee ordering. 

• Random ordering is used in our implementation. With random ordering, 
home nodes are relatively evenly distributed at all stages of the game, and 
for walks from any node, the most viable home nodes are of similar distances. 
Empirically, we have observed a stable performance. 

6. Numerical Results. To evaluate the proposed stochastic preconditioner, a 
set of benchmark matrices are generated by SPARSKIT |^ by finite-difference dis- 
cretization of the 3D Laplace's equation V^ii = with Dirichlet boundary condition. 
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The matrices correspond to 3D grids with sizes 50-by-50-by-50, 60-by-60-by-60, up 
to 100-by-lOO-by-lOO, and a right-hand-side vector with all entries being 1 is used 
with each of them. They are Usted in Table |0] as benchmarks ml to m6. Another 
four application-specific benchmarks are reported in Table 16.21 they are placement 
matrices from VLSI design, and are denser than the 3D-grid matrices. 

Table 6.1 

Computational complexity comparison of the hybrid solver, ICCG with ILU(O) (LASPACK), 
and ICCG with ILUT (MATLAB), to solve for one right-hand- side vector, for the 3D-grid bench- 
marks, with 10~® error tolerance. N is the dimension of a matrix; E is the number of non-zero 
entries of a matrix; C is the number of non-zero entries of the Cholesky factor; Ml is the number 
of multiplications per iteration; I is the number of iterations to reach 10~® error tolerance; M2 is 
the total number of multiplications; Rl is the speedup ratio of the hybrid solver over ICCG with 
ILU(O); R2 is the speedup ratio of the hybrid solver over ICCG with ILUT. 



Matrix 


N 


E 


ICCG with ILU(O) 


ICCG with ILUT 


Hybrid 


Rl 


R2 


C 


Ml 


/ 


M2 


C 


Ml 


I 


M2 


C 


Ml 


/ 


M2 






ml 


1.3c5 


8.6c5 


4.9e5 


2.3e6 


41 


9.6e7 


1.7e6 


4.8e6 


21 


1.0c8 


1.6c6 


4.5c6 


18 


8.1e7 


1.19 


1.25 


m2 


2.2c5 


1.5e6 


8.5e5 


4.1e6 


48 


1.9e8 


3.0e6 


8.4e6 


25 


2.1e8 


2.8c6 


7.9e6 


19 


1.5c8 


1.30 


1.40 


m3 


3.4e5 


2.4e6 


1.4e6 


6.5e6 


56 


3.6e8 


4.8e6 


1.3e7 


29 


3.9e8 


4.4e6 


1.3e7 


19 


2.4e8 


1.51 


1.61 


m4 


5.1e5 


3.5e6 


2.0e6 


9.7e6 


63 


6.1e8 


7.2e6 


2.0e7 


32 


6.4e8 


6.7e6 


1.9e7 


19 


3.6e8 


1.68 


1.77 


m5 


7.3e5 


5.1e6 


2.9e6 


1.4e7 


71 


9.8c8 


1.0c7 


2.9c7 


36 


1.0e9 


9.6e6 


2.7c7 


20 


5.5c8 


1.79 


1.88 


m6 


1.0e6 


6.9e6 


4.0e6 


1.9c7 


79 


1.5e9 


1.4e7 


3.9e7 


40 


1.6e9 


1.3e7 


3.8c7 


20 


7.5e8 


1.99 


2.09 



Table 6.2 

Computational complexity comparison of the hybrid solver, ICCG with ILU(O) (LASPACK), 
and ICCG with ILUT (MATLAB) , to solve for one right-hand- side vector, for the VLSI placement 
benchmarks, with le-6 error tolerance. N , E, C, Ml, I, M2, Rl and R2 are as defined in Table 



Matrix 


N 


E 


ICCG with ILU(O) 


ICCG with ILUT 


Hybrid 


Rl 


R2 


C 


Ml 


/ 


M2 


C 


Ml 


/ 


M2 


a 


A/1 


/ 


A/2 






ml 


4.3e5 


5.2e6 


2.8e6 


1.3c7 


122 


1.5o9 


6.5o6 


2.0c7 


62 


1.2o9 


6.5o6 


2.0o7 


12 


2.3e8 


6.6 


5.3 


m8 


3.5e5 


5.5e6 


2.9e6 


1.3c7 


82 


1.0c9 


5.1c6 


1.7c7 


27 


4.6c8 


5.0c6 


1.6c7 


12 


2.0e8 


5.3 


2.4 


m9 


4.6c5 


8.2e6 


4.3e6 


1.9e7 


110 


2.1e9 


7.5o6 


2.5o7 


55 


1.4o9 


8.0c6 


2.5o7 


13 


3.3e8 


6.3 


4.2 


mlO 


8.8e5 


9.4e6 


5.2e6 


2.3e7 


159 


3.7c9 


1.3c7 


3.9e7 


82 


3.2c9 


1.2c7 


3.7c7 


12 


4.4e8 


8.4 


7.1 



Table 6.3 

Physical runtimes of the hybrid solver on a Linux workstation with 2.8GHz CPU frequency. Tl 
is preconditioning CPU time. T2 is solving CPU time with le-6 error tolerance. Unit is second. 



Ckt 


ml 


m2 


m3 


m4 


m5 


m6 


m7 


m8 


m9 


mlO 


Tl 


5.38 


10.39 


17.97 


28.78 


44.01 


71.57 


33.00 


21.67 


46.91 


68.90 


T2 


2.52 


5.80 


10.59 


17.50 


28.61 


41.07 


11.90 


9.73 


17.07 


26.09 



In Table l?m and Table IfT^ we compare the proposed hybrid solver against ICCG 
with ILU(O) and ICCG with ILUT. The complexity metric is the number of double- 
precision multiplications needed at the iterative solving stage for the equation set 
Ax = b, in order to converge with an error tolerance of 10^^. This error tolerance is 
defined as: 

(6.1) 11 b-^x II2 < 10-6. II b II2 

LASPack is used for ICCG with ILU(O), and MATLAB is used for ICCG with 
ILUT. There are three node ordering algorithms available in MATLAB: minimum 
degree ordering (MMD) approximate minimum degree ordering (AMD) T, and 
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reverse Cuthill-McKee ordering (RCM) 0. AMD results in the best performance 
on the benchmarks and is used for all tests. The dropping threshold of ILUT in 
MATLAB is tuned, and the accuracy-size tradeoff of the proposed preconditioner is 
adjusted, such that the sizes of the Cholesky factors produced by both methods are 
similar, i.e., the C values in the tables are close. For LASPack and MATLAB, the 
Ml values are computed using the following equation. 

(6.2) Ml = C- 2 + E + N- 4: 

According to the PCG pseudo codes in and '53', the above equation is the best 
possible implementation. The Ml values of the hybrid solver is obtained by a detailed 
count embedded in its implementation, and in fact equation H6.2|l is roughly true for 
the hybrid solver as well. 

A clear trend can be observed in Table 16.11 and Table 16.21 that the larger and 
denser a matrix is, the more the hybrid solver outperforms ICCG. This is consistent 
with our argument in Sectional when the matrix is larger and denser, the effect of 
error accumulation in traditional methods becomes stronger. 

The physical runtimes are shown in Table 16.31 Admittedly, the preconditioning 
runtime Tl is more than the typical runtime of a traditional incomplete factorization; 
however, it is not a large overhead, gets easily amortized over multiple re-solves, and 
is worthwhile given the speedup achieved in the solving stage. 

A reference implementation of the stochastic preconditioning for R-matrices, as 
well as the hybrid solver, is available to the public 22 . 

7. Extensions. So far the discussion has been limited to R-matrices. This sec- 
tion presents techniques aimed at extending the theory to more general matrices, and 
speculates on potential challenges in future research on this topic. 

7.1. Asymmetric A Matrices. Let us first remove the symmetry requirement 
on matrix A. Recall that the construction of the random walk game and the derivation 
of equation l|3.7|l does not require A to be symmetric. Therefore, matrices Y and Z 
can still be obtained from random walks, and equation (|3.7II remains true for an 
asymmetric matrix A. Suppose rev(A) = -^rcv(yi)^rcv(A)CA-ov(yi)i where ircv(A) is a 
lower triangular matrix with unit diagonal entries, J7i.ev(A) is an upper triangular 
matrix with unit diagonal entries, and I?rcv(A) is a diagonal matrix. This is called the 
LDU factorization [7] , which is a slight variation of the LU factorization, and it is easy 
to show, based on Lemma |21 that the LDU factorization is also unique. Substituting 
the factorization into equation H3.7() . we have 



(7.1) rev(Z )rev(F) sa L 

Based on the uniqueness of LDU factorization, it must be true that 

(7.2) rev(r) « C/„v(A) 

(7.3) rev(Z"^) « A'ov(A)-Di.ov(A) 



By equation 1)7. 2|l . we can approximate C/rcv(A) based on Y; by equation (|7.3(l . and 
through the same derivation as in Section 13.31 we can approximate -Dicv(A) based on 
the diagonal entries of Z. The remaining question is how to obtain Lrov(A)- 

Suppose we construct a random walk game based on A"^ instead of A, and suppose 
we obtain matrices Y^t and Z^t based on equation (|3.4|l . Then according to equation 
(|7.2ll . we have 

(7.4) rev(y4T) w C/rcv(AT) 
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where J7icv(AT) is the U factor in the LDU factorization of iev{A^). It is easy to 
derive the foUowing 

(7.5) iev{A^) = (rev(A))'^ = {Ure^,(A))'^ D^ev{A) (-^rov(A))'^ 

Therefore, 

(7-6) ircv(AT)-Di.cv(AT)f/rev(AT) = {U^-cv{A)) A-ov(A) (ircv(A)) 

Based on the uniqueness of the LDU factorization, it must be true that 

T 

(7-7) (ircv(A)) = ^^rcv(AT) 

By (|Tl|l and ((77|l . we finally have 

(7.8) rev(YAT) w (£rcv(A))'^ 

In other words, we can approximate ircv(A) based on Yj^t. 

In summary, when matrix A is asymmetric, we need to construct two random 
walk games for A and A"^ , and then based on the two Y matrices and the diagonal 
entries of one of the Z matrices^^, we can approximate the LDU factorization of 
rev(A) based on equations (|3.15l) . (|7.2I) . and 1)7. 8|l . The proof of non-zero pattern is 
similar to Section 13.21 and with the same conclusion: the non-zero patterns of the 
resulting approximate L and U factors are subsets of those of the exact factors. Both 
the time complexity and space complexity of preconditioning become roughly twice 
those of the symmetric case: this is the same behavior as a traditional ILU. 

7.2. Random Walk Game with Scaling. By now, the symmetry restriction 
on matrix A has been removed, and the remaining requirements on A are: the diagonal 
entries must be positive; the off-diagonal entries must be negative or zero; A must 
irreducibly diagonally dominant, both row- wise and column-wise. 

m-, m, Wj. »7^„,,^j 

Fig. 7.1. A random walk in the modified game with scaling. 

To remove these constraints, a new game is designed by defining a scaling factor^® 
s on each direction of every edge in the original game from Section f2.1l Such a scaling 
factor becomes effective when a random walk passes that particular edge in that 
particular direction, and remains effective until this random walk ends. Let us look 
at the stochastic solver first. A walk is shown in Figure ITtI it passes a number of 
motels, each of which has its price mi, I {1, 2, • • • , F}, and ends at a home node with 
certain award value rriaward- The monetary gain of this walk is defined as follows. 

r-i r 

(7.9) gain = -mi - Sim2 - SiS2m3 J]^ • mr + ]^ s; • rriaward 

1=1 1=1 

^^Duc to the uniqueness of the LDU faetorization, it does not matter the diagonals of which Z 
are used. 

^*A similar concept of scaling factors can be found in ,2,, though tailored to its specific game 
design. 
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In simple terms, this new game is different from tire original game in that each trans- 
action amount during the walk is scaled by the product of the currently active scaling 
factors. Define the expected gain function / to be the same as in equation (|2.2|) . and 
it is easy to derive the replacement of equation H2.4|l : 



where Si_; denotes the scaling factor associated with the direction i ^ / of the edge 
between i and I, and the rest of the symbols are the same as defined in (|2.4() . 

Due to the degrees of freedom introduced by the scaling factors, the allowable 
left-hand-side matrix A is now any matrix with non-zero diagonal entries. In other 
words, given any matrix A with non-zero diagonal entries, a random walk game with 
scaling can be constructed such that the / values, if they uniquely exist, satisfy a set 
of linear equations where the left-hand-side matrix is A. 

A corresponding stochastic preconditioning method can be derived based on this 
new random walk game, by redefining the H and J values in equations H3.4|l . (|5.()|l . 
and (|5.8|l to be the sum of products of scaling factors. 

If every scaling factor in the game has an absolute value less or equal to 1, there 
is no numerical problem in the above new preconditioning procedure. This can be 
achieved as long as matrix A is diagonally dominant, in which case we can sim- 
ply assign scaling factors to be -1-1 or —1, or, if matrix A is complex- valued, assign 
complex-valued scaling factors with unit magnitude. If there exist scaling factors with 
absolute values over 1, however, numerical problems may potentially occur since the 
product of scaling factors may be unbounded. How to quantify this effect and to 
analyze the corresponding convergence rate, is an open question for future research. 

Therefore, the conclusion of this section is as follows. 
• If the left-hand-side matrix A is irreducibly diagonally dominant both row- 
wise and column-wise, the generalized stochastic preconditioning technique is 
guaranteed to work, and according to the argument in Section^] the resulting 
preconditioner is expected to outperform traditional incomplete factorization 



• If the left-hand-side matrix A is not diagonally dominant, as long as its diago- 
nal entries are non-zero, a random walk game exists such that the / values, if 
they uniquely exist, satisfy a set of linear equations where the left-hand-side 
matrix is A. However, no claim is made about the quality of the resulting 
preconditioner, and this is open for further investigation. 
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